RAG (Retrieval-Augmented Generation) in a nutshell

RAG (Retrieval-Augmented Generation) is the way to feed external, up-to-date data into the "black box" of a Large Language Model (LLM) without having to retrain it.

RAG is best understood as a two-phase architecture:

  • Offline indexing (the data storage)
  • Online retrieval (the chatbot runtime process)

Phase 1: The indexing process (data storage)
This is the crucial "offline" step in which knowledge is prepared. The aim is to bring unstructured data (PDFs, docs, web pages) into a format that enables a semantic search (meaning-based search) instead of a pure keyword search.

1. loading (data loading)
First, the raw data must be loaded. There are specialized data loaders for this (e.g. from frameworks such as LlamaIndex or LangChain).

  • PDFLoader reads text from PDFs.
  • WebBaseLoader scrapes a web page.
  • DirectoryLoader loads all files from a folder.

2. segmenting (chunking)
You cannot process a 100-page PDF in one go. The data must be broken down into manageable pieces, so-called "chunks".

Why?

  • Embedding quality: Embedding models (see next point) work best with short, coherent chunks of text. Context limit: The LLM has a limited "context window" (e.g. 4k, 16k or 128k tokens). The retrieved context must fit into it.
  • Methods:Fixed-size chunking: Simple (e.g. "always use 1000 characters"). Fast, but "stupid" as it can tear up sentences.
  • Recursive Character Text Splitter: A more advanced method that attempts to split at logical points (paragraphs, sentences, words).
  • Semantic Chunking: (Advanced) Uses an embedding model to recognize where a semantic topic change occurs and cuts there.


3. vectorizing (embedding)
This is the heart of "data storage". Each text chunk is now converted into a numerical vector (an "embedding").

  • What is this? A vector is a long list of numbers (e.g. 384, 768 or 1536 dimensions) that represents the semantic meaning of the text chunk.
  • How? By means of an embedding model (e.g. text-embedding-ada-002 from OpenAI, all-MiniLM-L6-v2 from Sentence-Transformers or the new text-embedding-3-small).
  • The result: Texts with similar meaning (e.g. "How much are the shipping costs?" and "How much does the delivery cost?") have vectors that are "close" to each other in the high-dimensional vector space.

4. storage (Storage & Indexing)
Now the actual vector database comes into play.

What is stored? At least three things are stored for each chunk:

  • Vector (the list of numbers)
  • Original text (the chunk itself)
  • Metadata (e.g. source: 'document_A.pdf', page: 42, chapter: 'Security')

Why a special DB? A normal SQL DB is very bad at finding similarity. It can find WHERE text LIKE '%delivery%', but not "find text that sounds like 'delivery'".

  • The technology: Vector databases (like Pinecone, Weaviate, Chroma, Milvus, or pgvector as a Postgres extension) are optimized for a single task: Approximate Nearest Neighbor (ANN) Search.
  • The index: The DB creates a special index (e.g. HNSW - Hierarchical Navigable Small Worlds) that allows to find extremely fast (in milliseconds) the "nearest neighbors" (the most similar vectors) to a new vector without having to compare every single vector in the DB (which would be an O(n) brute force scan).

Summary data storage: The knowledge base is now no longer a text, but a highly optimized, searchable database of semantic vectors pointing to their original texts.

Phase 2: The retrieval process (the chatbot in operation)

The following processes take place when a user asks a question.

1. query embedding
The user types in a question: "What are the delivery times?"

This question (the "query") is processed using the same embedding model as the documents in phase 1.

Result: A vector that represents the meaning of the user question.

2. vector search (the "retrieval")
This query vector is now sent to the vector database.

  • The DB uses its HNSW index to find the top-k (e.g. k=5) most similar vectors from your index.
  • Search type: The "proximity" is usually calculated using cosine similarity or Euclidean distance.
  • Filtering (optional): This is where the metadata becomes important. If the user asks: "What does document A say about delivery times?", you can filter the search as follows: "Find top-k vectors WHERE source == 'document_A.pdf'".

3. augmentation (the "enrichment")
The database returns the top-k results. These are not only the vectors, but also the corresponding original text chunks (which are also saved in step 1.4). These chunks (the "context") are now automatically copied in front of the original user question in a new, larger prompt.

Example prompt for the LLM:
You are a helpful assistant. Answer the user's question
based solely on the following context:

--- CONTEXT START ---
[Text-Chunk 3 - from doc_A.pdf, page 5]
"Our standard delivery time is 3-5 working days.
Express deliveries are made within 24 hours."

[Text-Chunk 1 - from faq.html]
"Orders placed before 12 noon are processed the same day.
The shipping costs are a flat rate of €4.99."

[Text-Chunk 42 - from agb.pdf, page 12]
"Delays due to force majeure are excluded..."
--- CONTEXT END ---

Question from the user: What are the delivery times?

Answer:

4th generation (The "generation")

  • This entire "augmented prompt" is sent to an LLM (e.g. GPT-4, Llama 3 or Claude 3).
  • The LLM no longer needs to have any "knowledge of the world" or guess (hallucinate). The answer is directly in context.
  • The LLM generates the answer: "The standard delivery time is 3-5 working days. Express deliveries are possible within 24 hours."

Important considerations

  • Chunking strategy is crucial: chunks that are too small have no context. Chunks that are too large "dilute" the embedding and contain too much "noise" (irrelevant information) for the LLM prompt.
  • Hybrid search: Pure vector search is bad for specific proper names, product SKUs or IDs (e.g. "Error_404_Fix"). Here, vector search (semantic) is combined with traditional keyword search (e.g. BM25).
  • Optional: If the search results are not optimal, a re-ranking is performed: Sometimes the top k=50 results (fast, "rough") are retrieved first and then a smaller, specialized re-ranking model (e.g. Cohere Rerank) is used to sort the 50 to the "real" top k=5 before they go into the LLM (quality vs. latency).